The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
Loading required package: IRanges
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'Biobase'
The following object is masked from 'package:MatrixGenerics':
rowMedians
The following objects are masked from 'package:matrixStats':
anyMissing, rowMedians
library(SummarizedExperiment)
species <-"Homo sapiens"# Variable d'annotation à visualiser (colData(airway))plot_annotations <-"dex"# traitement au dexaméthasone# Paramètres de qualité (QC)qc_min_nsamp <-2qc_min_gene_counts <-10# valeur un peu plus élevée pour un jeu réel# Clustering d'expressionexp_cluster <-data.frame(k =2)# Clustering des métadonnéesmetadata_clusters <-list(pathway_scores =data.frame(k =2),microenv_scores =data.frame(k =3))# Collections de pathways (humaines)pathway_collections <-"CP:KEGG_LEGACY"# KEGG tout court est plus standard pour Homo sapiens# Gènes pertinents dans airway pour les heatmaps (liés au métabolisme, réponse au stress, etc.)heatmap_genes <-list(c("NR3C1", "FKBP5", "TSC22D3", "ZBTB16", "PER1"), # gènes régulés par le glucocorticoïdec("CYP1B1", "G6PD", "HMOX1", "NQO1", "SOD2") # stress oxydatif / métabolisme)# Exemples de pathways (à adapter selon les résultats d'analyse)heatmap_pathways <-c("KEGG_APOPTOSIS","KEGG_GLUTATHIONE_METABOLISM")# Gènes pour les boxplots (fortement exprimés et différentiels)boxplot_genes <-c("FKBP5", "TSC22D3")# Pathways pour les boxplotsboxplot_pathways <-c("KEGG_APOPTOSIS","KEGG_GLUTATHIONE_METABOLISM")# Corrélations entre gènescorrelation_genes <-list(c("FKBP5", "TSC22D3"),c("FKBP5", "ZBTB16"))# Corrélations entre pathwayscorrelation_pathways <-list(c("KEGG_APOPTOSIS", "KEGG_GLUTATHIONE_METABOLISM"),c("KEGG_APOPTOSIS", "KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY"))
library(CAIBIrnaseq)
Load data
This section loads the RNA-seq dataset for analysis. It ensures the correct input file is used, as specified in the parameters. rebase_gexp
data(airway, package="airway")exp_data <- airwayrowData(exp_data)$gene_length_kb <- (rowData(exp_data)$gene_seq_end -rowData(exp_data)$gene_seq_start) /1000library(biomaRt)# Initialiser biomaRt pour Ensembl humainmart <-useMart("ensembl", dataset ="hsapiens_gene_ensembl")# Obtenir les descriptions basées sur les gene_idgene_ids <-rowData(exp_data)$gene_idannot <-getBM(attributes =c("ensembl_gene_id", "description"),filters ="ensembl_gene_id",values = gene_ids,mart = mart)# Faire correspondre les descriptions aux lignes de rowDatamatched <-match(rowData(exp_data)$gene_id, annot$ensembl_gene_id)rowData(exp_data)$gene_description <- annot$description[matched]
Pre-processing
Most datasets use ensemble gene ID by default after alignment, so this step rebases the expression data to gene names. This ensures consistency in naming for downstream analyses.
-- Rebasing the gene expression matrix using `gene_name` as main annotation
Filter
Here, we filter out genes expressed in too few samples or with very low counts. This removes noise from the data and focuses on meaningful gene expressions.
- Keeping 33026/56638 genes found in at least 1 sample(s) with at least 1 counts
diffexp <- diffExpAnalysis(countData = SummarizedExperiment::assays(exp_data)$counts, sampleInfo = SummarizedExperiment::colData(exp_data), method = diffexpMethod, cutoff = 10, design, coefname) Visualization of the filtering process to ensure the criteria applied align with the dataset’s characteristics:
Warning: Removed 8 rows containing missing values or values outside the scale range
(`geom_text_repel()`).
Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
If you'd like to see this geom implemented,
Please open an issue with your example code at
https://github.com/ropensci/plotly/issues
Normalize
Here, we apply a normalization to the expression data, making samples comparable by reducing variability due to technical differences. For datasets with few samples, rlog is the preferred normalization and when more samples are present, vst is applied. class(exp_data)
exp_data <-normalize_gexp(exp_data)
- Less than 30 samples -> Performing `rlog` normalization...
PCA
Principal component analysis (PCA) identifies the major patterns in the dataset. These patterns help explore similarities or differences among samples based on gene expression.
pca_res =pca_gexp(exp_data)exp_data@metadata[["pca_res"]] <- pca_resannotations <-setdiff(plot_annotations, c("exp_cluster", "path_cluster"))plot_pca(exp_data, color = plot_annotations)
Saving 7 x 5 in image
library(factoextra)
Loading required package: ggplot2
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
groups <- SummarizedExperiment::colData(exp_data)$dex # ou le nom réel de la colonne indiquant STING_KO/Wild_typefviz_pca_ind(pca_res,geom ="point",habillage = groups,palette =c("#00AFBB", "#E7B800"), # couleurs personnaliséesaddEllipses =TRUE,ellipse.type ="confidence",repel =TRUE,label ="none")
Unsupervised clustering
Here, we group samples based on expression patterns without prior knowledge using hierarchical clustering on either a selected gene list from the parameters or, by default, the 2000 most highly expressed genes.
This can be useful for discovering sample subgroups or new biological insights.
exp_data <-cluster_exp(exp_data, k = exp_cluster$k, genes = exp_cluster$genes, n_pcs =3)
Clustering based on the top 2000 highly variable genes.
-- Reducing dimensionality using PCA
-- Performing hierarchical clustering into 2 groups
Visual representation of expression levels for HVG across clusters, highlighting distinct patterns.
-- Saving heatmap at results/clustering/heatmap_2000hvg_exp_cluster.pdf
Pathway activity
Pathway analysis enables us to understand the functional implications of gene expression changes. Here, we analyze the dataset for pathway activity using two methods.
PROGENy
PROGENy is a collection of only 14 core pathway responsive genes from large signaling perturbation experiments. For more information see the original paper.
progeny_scores <-score_progeny(exp_data, species ="Homo sapiens")tmp <- S4Vectors::metadata(exp_data) # récupère une liste depuis exp_datatmp[["progeny_scores"]] <- progeny_scores # modifie cette listeS4Vectors::metadata(exp_data) <- tmp plot_progeny_heatmap(exp_data, annotations = plot_annotations,fname ="results/pathways/hm_progeny_scores.pdf")
Warning in prep_scores_hm(exp_data, progeny_scores): 'sample_id' already exists
in colData and will be overwritten.
-- Saving heatmap at results/pathways/hm_progeny_scores.pdf
Pathway collections available in the MSIGdb can be specified in the parameters. These pathways are scored and ranked by their variance in the data. These are the available collections (use gs_subcat as name except for Hallmarks, which should be ‘H’).
This step calculates immune and stromal cell type abundances using MCPcounter or mMCPcounter. It helps to infer the composition of the tumor microenvironment or similar contexts.
This section visualizes relationships between pairs of genes or pathways by plotting their expression/activity correlations. Correlation analysis can reveal important co-regulation or interaction patterns, helping to uncover biologically meaningful relationships.
Selected genes
Here we plot the correlation between selected gene pairs across the dataset. Each pair is plotted separately, and color-coded by sample annotation.
Correlation plots for selected pathways can help identify similarities or differences in pathway activity patterns across samples. Each pathway pair is plotted separately and color-coded by sample annotation to illustrate trends within each condition.